We have the following general questions to examine whether they are possible:
Here are three main reasons to select CitiBike trip data:
dropoff data for Taxi data better represents actual needs for trips to take from. This Hypothesis is safe to make due to:need is from, the dropoff data reflects this more than pickup since people are more close to from dropoff spot to there destinition to from they want to get a taxi to they actaully get one or even can't get oneRun the following command in terminal:
Rscript -e "rmarkdown::render('FinalProject_FDS.Rmd')"
WORKDIR <- getwd()
FIGDIR <- file.path(WORKDIR, 'fig')
SRCDIR <- file.path(WORKDIR, 'src')
DATADIR <- file.path(WORKDIR, 'data')
if (!dir.exists(FIGDIR)) dir.create(FIGDIR)
if (!dir.exists(SRCDIR)) dir.create(SRCDIR)
if (!dir.exists(DATADIR)) dir.create(DATADIR)
library('dplyr')
library('reshape2')
library('ggplot2')
library('lubridate')
library('readr')
library('scales')
library('cowplot')
library('forecast')
library('pheatmap')
library('ggmap')
library('igraph')
library('NbClust')
#theme_set(theme_bw(base_size = 12))
myGthm <- theme(text = element_text(size = 15),
legend.position='bottom')
CitiBike data source is https://s3.amazonaws.com/tripdata/index.html. NYC Taxi data source is http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. Here we provided bash scripts to download and unzip data.
If you would like to run from scratch, please run following commands in terminal:
# go to work-directory of project (same as this RMD file)
cd .
# create data folder
mkdir -p data
cd ./src
# You may want to modify the sh file to include/exclude datas
sh get_data.sh
sh get_data_taxi.sh
If successful, CitiBike trip data set is available at folder data to be read in.
infile <- list.files(path=file.path(DATADIR), pattern='*-citibike-tripdata.csv',
full.names = T)
data <- bind_rows(lapply(infile, function(x) {
read.csv(x, stringsAsFactors=FALSE)}))
use.RDS.for.taxi <- TRUE
RDS.name.for.taxi <- 'taxi.RDS'
trip.data <-
if(use.RDS.for.taxi){
readRDS(file.path(DATADIR, RDS.name.for.taxi))
}else{
infile.taxi <- list.files(path=file.path(DATADIR), pattern='\\S*_tripdata_\\d{4}-\\d{2}\\.csv',
full.names = T)
bind_rows(lapply(infile.taxi, function(x) {
read.csv(x, stringsAsFactors=FALSE)}))
}
head(trip.data)
## VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag
## 1 2 2016-06-01 02:46:38 2016-06-01 03:06:40 N
## 2 2 2016-06-01 02:55:26 2016-06-01 03:06:52 N
## 3 2 2016-06-01 02:50:36 2016-06-01 03:08:39 N
## 4 2 2016-06-01 02:57:04 2016-06-01 03:07:52 N
## 5 2 2016-06-01 02:52:03 2016-06-01 03:08:12 N
## 6 2 2016-06-01 02:59:03 2016-06-01 03:09:25 N
## RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude
## 1 1 -73.93058 40.69518 -74.00005
## 2 1 -73.94693 40.79255 -73.95157
## 3 1 -73.94453 40.82396 -73.99466
## 4 1 -73.95221 40.82387 -73.91436
## 5 1 -73.95798 40.71783 -73.95402
## 6 1 -73.96532 40.71103 -73.98969
## Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax
## 1 40.72905 1 5.24 19.5 0.5 0.5
## 2 40.82516 1 3.14 11.5 0.5 0.5
## 3 40.75042 1 7.50 23.5 0.5 0.5
## 4 40.81470 1 2.27 10.5 0.5 0.5
## 5 40.65512 3 4.90 16.5 0.5 0.5
## 6 40.71417 1 2.76 11.0 0.5 0.5
## Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount
## 1 6.24 0 NA 0.3 27.04
## 2 2.56 0 NA 0.3 15.36
## 3 2.00 0 NA 0.3 26.80
## 4 0.00 0 NA 0.3 11.80
## 5 0.00 0 NA 0.3 17.80
## 6 2.46 0 NA 0.3 14.76
## Payment_type Trip_type
## 1 1 1
## 2 1 1
## 3 1 1
## 4 2 1
## 5 1 1
## 6 1 1
Alternatively, I have attached a RDS file which is exactly same as the above. Load the RDS file if you don’t want to download and avoid time on importing data.
data <- readRDS(file.path(DATADIR, 'citibike.rds'))
# data <- read.csv(infile[1], stringsAsFactors = F)
# data <- sample_frac(data, 0.1)
print(head(data))
## tripduration starttime stoptime start.station.id
## 1 1346 1/1/2015 0:01 1/1/2015 0:24 455
## 2 363 1/1/2015 0:02 1/1/2015 0:08 434
## 3 346 1/1/2015 0:04 1/1/2015 0:10 491
## 4 182 1/1/2015 0:04 1/1/2015 0:07 384
## 5 969 1/1/2015 0:05 1/1/2015 0:21 474
## 6 496 1/1/2015 0:07 1/1/2015 0:15 512
## start.station.name start.station.latitude start.station.longitude
## 1 1 Ave & E 44 St 40.75002 -73.96905
## 2 9 Ave & W 18 St 40.74317 -74.00366
## 3 E 24 St & Park Ave S 40.74096 -73.98602
## 4 Fulton St & Waverly Ave 40.68318 -73.96596
## 5 5 Ave & E 29 St 40.74517 -73.98683
## 6 W 29 St & 9 Ave 40.75007 -73.99839
## end.station.id end.station.name end.station.latitude
## 1 265 Stanton St & Chrystie St 40.72229
## 2 482 W 15 St & 7 Ave 40.73936
## 3 505 6 Ave & W 33 St 40.74901
## 4 399 Lafayette Ave & St James Pl 40.68852
## 5 432 E 7 St & Avenue A 40.72622
## 6 383 Greenwich Ave & Charles St 40.73524
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.99148 18660 Subscriber 1960 2
## 2 -73.99932 16085 Subscriber 1963 1
## 3 -73.98848 20845 Subscriber 1974 1
## 4 -73.96476 19610 Subscriber 1969 1
## 5 -73.98380 20197 Subscriber 1977 1
## 6 -74.00027 20788 Subscriber 1969 2
There are 3 categories of features:
The time information in raw data should be converted to time format recognized by R language.
data$startTimestamp <- as.POSIXct(strptime(data$starttime, '%m/%d/%Y %H:%M',
tz = 'EST5EDT'))
data$stopTimestamp <- as.POSIXct(strptime(data$stoptime, '%m/%d/%Y %H:%M',
tz = 'EST5EDT'))
data$startweekday <- factor(weekdays(data$startTimestamp),
levels= c("Sunday", "Monday","Tuesday",
"Wednesday", "Thursday", "Friday",
"Saturday"))
data$stopweekday <- factor(weekdays(data$stopTimestamp),
levels= c("Sunday", "Monday","Tuesday",
"Wednesday", "Thursday", "Friday",
"Saturday"))
data$startHr <- format(data$startTimestamp, '%H')
data$stopHr <- format(data$stopTimestamp, '%H')
start_lat_min <- min(data$start.station.latitude)
start_lat_max <- max(data$start.station.latitude)
start_lon_min <- min(data$start.station.longitude)
start_lon_max <- max(data$start.station.longitude)
plot_lat_bt <- start_lat_min - 2
plot_lat_up <- start_lat_max + 2
plot_lon_lft <- start_lon_min - 2
plot_lon_rit <- start_lon_max + 2
The below are results # Visualize all available stations on map
stations_start <- dplyr::select(data, starts_with('start.station')) %>% unique()
stations_end <- dplyr::select(data, starts_with('end.station')) %>% unique()
colnames(stations_start) <- colnames(stations_end) <- c('STATION_ID',
'STATION_NAME',
'STATION_LAT',
'STATION_LON')
stations <- rbind(stations_start, stations_end) %>% unique()
rownames(stations) <- stations$STATION_NAME
# reorder columns as stations_name should be first to create igraph
stations <- dplyr::select(stations, STATION_NAME, STATION_ID,
STATION_LAT, STATION_LON)
map_stations_loc <- function(df,
plot_lat_bt=38.68034, plot_lat_up=42.77152,
plot_lon_lft=-76.01713, plot_lon_rit=-71.95005){
## show stations on maps
# x is data frame for stations info
q <- qmplot(x=STATION_LON, y=STATION_LAT,
data = df,
maptype = 'toner-lite',
extent = 'device',
zoom = 14,
color=I('red'), alpha = I(.7))
return(q)
}
p_all_stations <- map_stations_loc(stations)
print(p_all_stations)
Each row in raw data is a trip, so the raw data set is in long-format. In order to create temporal profiles of stations, the long-format should be summarized into wide-format.
set.seed(1234)
## Long-format to wide-format of citybike
citibike_long2wide <- function(data, activity_type = c('pick', 'dock')){
if (activity_type == 'pick') {
station_24hr_long <- dplyr::select(data, start.station.name, startHr)
} else if (activity_type == 'dock') {
station_24hr_long <- dplyr::select(data, end.station.name, stopHr)
} else {
stop('Error: activity type is either pick or dock.')
}
colnames(station_24hr_long) <- c('NAME', 'HOUR')
station_24hr_long <- group_by(station_24hr_long, NAME, HOUR) %>%
summarise(TRIPS=n())
station_24hr_wide <- dcast(station_24hr_long, NAME~HOUR,
value.var = 'TRIPS', sum, fill=0)
rownames(station_24hr_wide) <- station_24hr_wide$NAME
station_24hr_wide <- station_24hr_wide[, -1]
# print(head(station_24hr_wide))
return(station_24hr_wide)
}
pick_station_24hr <- citibike_long2wide(data=data, activity_type='pick')
dock_station_24hr <- citibike_long2wide(data=data, activity_type='dock')
The original data set is summarized into two big matrices. One for picking, while the other is for docking. Each row is station and each column is the hour point. Example of wide-format matrix of pick activity is:
print(head(pick_station_24hr[, 1:6]))
## 00 01 02 03 04 05
## 1 Ave & E 15 St 46 22 21 8 6 22
## 1 Ave & E 18 St 9 1 0 0 1 13
## 1 Ave & E 30 St 14 3 9 2 1 2
## 1 Ave & E 44 St 5 0 0 0 0 3
## 10 Ave & W 28 St 13 8 12 10 2 16
## 11 Ave & W 27 St 4 1 5 2 2 10
## quartz_off_screen
## 2
## quartz_off_screen
## 2
## quartz_off_screen
## 2
## quartz_off_screen
## 2
To visualize the matrix, here I selected top 20 stations with most picking activities for example.
orderbyRowSum <- function(data){
o <- mutate(data, SUM=rowSums(data), row_names=rownames(data)) %>%
dplyr::arrange(desc(SUM), row_names) %>%
dplyr::select(-SUM)
rownames(o) <- o$row_names
o <- dplyr::select(o, -row_names)
return(o)
}
top_N <- 20
pick_station_24hr_desc <- orderbyRowSum(pick_station_24hr)
dock_station_24hr_desc <- orderbyRowSum(dock_station_24hr)
top_pick_station_names <- rownames(pick_station_24hr_desc)[seq_len(top_N)]
top_dock_station_names <- rownames(dock_station_24hr_desc)[seq_len(top_N)]
desc_pick_station_names <- rownames(pick_station_24hr_desc)
desc_dock_station_names <- rownames(dock_station_24hr_desc)
pheatmap(pick_station_24hr_desc[top_pick_station_names, ],
cluster_rows = F, cluster_cols = F,
main='Picking activities of Stations with Top 24-hr Picking Activities')
pheatmap(dock_station_24hr_desc[top_pick_station_names, ],
cluster_rows = F, cluster_cols = F,
main='Docking activities of Stations with Top 24-hr Picking Activities')
## quartz_off_screen
## 2
If compare the two matrices side-by-side, stations have different rush-hour:
It can be inferred there exists other pattern, but it not straightforward to make conclusions from the two matrix above. Therefore, it is suggested to develop a metric to integrate the two information together.
Here is how P/D Index is calculated.
row_norm_byMax <- function(x){
t(apply(x, 1, function(r) {
r/max(r)
}))
}
pd_station_24hr <- (row_norm_byMax(pick_station_24hr)+1) / (row_norm_byMax(dock_station_24hr)+1)
pd_station_24hr <- as.data.frame(pd_station_24hr)
The intuition of P/D index is that high P/D index suggests one of the following three situations:
The P/D index of the same stations as shown before is displayed in the following heat-map.
pheatmap(pd_station_24hr[top_pick_station_names, ],
cluster_rows = F,
cluster_cols = F,
main='P/D Index of Stations with Top 24-hr Picking Activities')
print('P/D Index has average:')
## [1] "P/D Index has average:"
print(sum(pd_station_24hr)/prod(dim(pd_station_24hr)))
## [1] 1.0313
Now it is much more straightforward to infer possible subgroups of stations. Now I applied K-means for all about 300 stations to find the hidden subgroups of stations in terms of temporal profiles.
p_pd_kmeans <- pheatmap(pd_station_24hr,
kmeans_k = K,
cluster_cols = F,
main='P/D Index of Stations')
## quartz_off_screen
## 2
Therefore, there are at least 3 types of stations by P/D index in 24hr:
rush_hr_morning_lab <- c('06', '07', '08')
rush_hr_evening_lab <- c('16', '17', '18')
# Manually extract clusters of interest
# double-check before continue
# pheatmap(p_pd_kmeans$kmeans$centers, cluster_cols = F)
pd_kmeans_center <- p_pd_kmeans$kmeans$centers
pd_kmeans_cluster <- p_pd_kmeans$kmeans$cluster
## Type-1
stations_type1_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(5, 9)])
p_station_type1 <- dplyr::filter(stations, STATION_NAME %in% stations_type1_names) %>%
map_stations_loc()
## Type-2
stations_type2_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(4, 8, 10)])
p_station_type2 <- dplyr::filter(stations, STATION_NAME %in% stations_type2_names) %>%
map_stations_loc()
## Type-3
stations_type3_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(2, 6)])
p_station_type3 <- dplyr::filter(stations, STATION_NAME %in% stations_type3_names) %>%
map_stations_loc()
## Type-4 unknown: the rest stations
stations_types_summary <- rep('X', NROW(stations))
stations_types_summary[stations$STATION_NAME %in% stations_type1_names] <- LETTERS[1]
stations_types_summary[stations$STATION_NAME %in% stations_type2_names] <- LETTERS[2]
stations_types_summary[stations$STATION_NAME %in% stations_type3_names] <- LETTERS[3]
stations_byTypes <- dplyr::mutate(stations, TYPE=stations_types_summary)
p_stations_byTypes <- qmplot(x=STATION_LON, y=STATION_LAT,
data = stations_byTypes,
maptype = 'toner-lite',
extent = 'device',
zoom = 14,
color=TYPE,
size=I(2.5))
print(p_stations_byTypes)
First, I developed P/D index to integrate the picking and docking activities as a straightforward metric to reflect the temporal profile of stations.
Second, by performing clustering on hourly P/D index matrix of stations, the stations with different hourly activities formed several distinct subgroups. Among them, the Type-A is likely to be home-like stations, and Type-B is like company-like stations, and Type-C is regular stations.
Finally, by grouping stations into different groups, I can figure out suggestions to CitiBike company to better manually balance bike among stations.
Stations have different “rush-hour”, therefore, it is worthwhile to treat them differently and perform time-series analysis to model the temporal profile for possible forecasting.
Here I performed time-series analysis and use AR(4) model to fit the usage at specific station.
Prepare observed usage at specific station.
s <- stations_type1_names[1]
sid <- dplyr::filter(stations, STATION_NAME == s) %>%
select(STATION_ID) %>% as.numeric()
print(s)
## [1] "1 Ave & E 15 St"
# Month of interest
m <- 'Jan'
# station-date-hour-trips data.frame
# Note: in some hour point, there is no activity at all thus need set them as zero
pick_s_days_avail <- dplyr::filter(data,
start.station.name %in% s,
month(startTimestamp, label=T) == m) %>%
mutate(DATE=date(startTimestamp)) %>%
group_by(start.station.name, DATE, startHr) %>%
summarise(TRIPS=n()) %>% as.data.frame()
colnames(pick_s_days_avail) <- c('STATION_NAME', 'DATE', 'HOUR', 'TRIPS')
pick_s_days_grid <- expand.grid(STATION_NAME=s,
DATE=unique(pick_s_days_avail$DATE),
HOUR=sprintf("%02d", 0:23),
TRIPS=0)
pick_s_days <- bind_rows(pick_s_days_avail, pick_s_days_grid) %>%
group_by(STATION_NAME, DATE, HOUR) %>%
summarise(NTRIPS=sum(TRIPS)) %>%
mutate(TIMESTAMP=ymd_h(paste(DATE, HOUR), tz='EST5EDT'),
YEAR=year(DATE))
obs_trips <- as.numeric(pick_s_days$NTRIPS)
ar4 <- ar(obs_trips, F, 4)
fit_trips <- fitted(ar4)
fit_trips[seq_len(4)] <- obs_trips[seq_len(4)]
pick_s_days_ar <- ungroup(pick_s_days) %>% dplyr::select(DATE, HOUR, YEAR) %>%
mutate(OBS=obs_trips, FIT=fit_trips) %>%
melt(id.vars=c('DATE', 'HOUR', 'YEAR'), value.name='NTRIPS',
variable.name='Type')
p_pick_s_ar <- ggplot(pick_s_days_ar, aes(x=DATE, y=HOUR, fill=NTRIPS)) +
geom_tile(color = "white", size = 0.4) +
scale_fill_gradient(low="ghostwhite", high="red") +
scale_x_date(date_breaks="1 week", date_labels="%d-%b-%y") +
facet_grid(YEAR~Type) +
xlab('') + ylab('Hour') + labs(fill='Trips') +
ggtitle(paste0('Observed and AR(4) fitted picking activity at ', s))
print(p_pick_s_ar)
By the AR(4) model, the temporal profile of the specific station is created. If needed, the CitiBike company can predict the future usage by this model.
There is a directed weighted graph behind the raw data set. Each station can be considered as the node, and the trip can be the directed edge. Number of trips between stations denotes the weight of each edge.
Therefore, it is suggested to run community detection algorithm with following two goals:
The algorithm used here is called Info-map, which is built-in function in igraph package. Betweenness-based method is not used, as it is much more time-consuming.
net0_edges_wt <- group_by(data, start.station.name, end.station.name) %>%
summarise(TRIPS=n()) %>%
ungroup() %>% as.data.frame()
colnames(net0_edges_wt) <- c('from', 'to', 'weight')
stations <- dplyr::select(stations, STATION_NAME, STATION_ID,
STATION_LAT, STATION_LON)
net0 <- graph_from_data_frame(d=net0_edges_wt,
directed=T,
vertices=stations)
E(net0)$width <- E(net0)$weight / 10
E(net0)$arrow.size <- .2
E(net0)$edge.color <- "gray80"
## quartz_off_screen
## 2
## quartz_off_screen
## 2
## quartz_off_screen
## 2
community_detect_stations <- function(net, nodes_geo,
method=c('infomap', 'betweenness')){
if (method == 'infomap'){
imc <- cluster_infomap(net)
memb <- membership(imc)
} else if (method == 'betweenness') {
ebc <- cluster_edge_betweenness(net)
memb <- membership(ebc)
} else {
stop('Unknown method for community detection')
}
stations_community <- data.frame(STATION_NAME=names(memb),
COMMUNITY=factor(memb))
p_community <- left_join(x=stations_community, y=nodes_geo,
by=c('STATION_NAME')) %>%
qmplot(data = ., x=STATION_LON, y=STATION_LAT,
maptype = 'toner-lite',
extent = 'device',
zoom = 14,
color=COMMUNITY, shape=COMMUNITY, size=I(2.5))
}
p_community_infomap_nyc <- community_detect_stations(net=net0,
nodes_geo=stations,
method='infomap')
print(p_community_infomap_nyc)
It is not surprising to see the entire New York City is partitioned into Manhattan and Brooklyn communities based on the full data set.
Type-A is home-like station and Type-B is company-like station.
net_stations_typeAnB <- induced.subgraph(net0, vids=c(stations_type1_names,
stations_type2_names))
p_community_infomap_typeAnB <- community_detect_stations(net=net_stations_typeAnB,
nodes_geo=stations,
method='infomap')
print(p_community_infomap_typeAnB)
In Manhattan, there are 2 big communities. The two communities are generally separated by 23th Street. That is to say, for example, the Type-A stations around Penn Stations are more likely to reach destinations in East side and will not go to downtown’s direction. And Type-A stations around Avenue A & E 10th Street are more likely to go to Wall Street areas.
There are two layers of evaluation.
For first issue, the answer is no. Because this project is mostly data-driven, though I did start from a hypothesis that the activity at stations reflects the hidden pattern of human preference, the entire analysis is exploratory and there is no golden-standard for evaluating whether these findings are true. The argument for evaluating the results from unsupervised machine learning can be applied to this project. Exploratory data analysis does not mean there is no way to evaluate whether method is appropriate, so the second issue is answered yes. Throughout the project there are several sections to monitor the performance.
For example, in order to infer the appropriate number of clusters when performing K-means on 24-hour P/D index matrix, I used silhouette metric, which is implemented in NbClust package.
pd_station_K <- NbClust(data = pd_station_24hr,
min.nc=3, max.nc=20, method='kmeans', index='silhouette')
plot_nbclust_obj <- function(nbclust_obj){
nbclust_stats_k <- as.numeric(names(nbclust_obj$All.index))
nbclust_stats_val <- as.numeric(nbclust_obj$All.index)
nbclust_stats_val[is.na(nbclust_stats_val)] <- 0
plot(x=nbclust_stats_k, y=nbclust_stats_val, type="b",
xlab="Number of possible clusters",
ylab='silhouette',
main="Infer the best number of clusters")
}
plot_nbclust_obj(pd_station_K)
Because I would manually investigate the clustered heat-map to pick up the groups to define Type-A, Type-B and Type-C stations, it is ok to manually set K as 10 at first. The results of 3 types are consistent to automatically set K as 3, inferred from the model evaluation part.
save.and.print <- function (plot.func, file.name){
pdf(file.path(FIGDIR, file.name), 10, 10)
p <- plot.func()
print(p)
dev.off()
print(plot.func())
}
print(summary(trip.data))
## VendorID lpep_pickup_datetime Lpep_dropoff_datetime
## Min. :1.000 Length:1404726 Length:1404726
## 1st Qu.:2.000 Class :character Class :character
## Median :2.000 Mode :character Mode :character
## Mean :1.795
## 3rd Qu.:2.000
## Max. :2.000
##
## Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude
## Length:1404726 Min. : 1.000 Min. :-75.92 Min. : 0.00
## Class :character 1st Qu.: 1.000 1st Qu.:-73.96 1st Qu.:40.69
## Mode :character Median : 1.000 Median :-73.95 Median :40.75
## Mean : 1.092 Mean :-73.83 Mean :40.69
## 3rd Qu.: 1.000 3rd Qu.:-73.92 3rd Qu.:40.80
## Max. :99.000 Max. : 0.00 Max. :42.32
##
## Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance
## Min. :-75.92 Min. : 0.00 Min. :0.000 Min. : 0.000
## 1st Qu.:-73.97 1st Qu.:40.70 1st Qu.:1.000 1st Qu.: 1.070
## Median :-73.95 Median :40.75 Median :1.000 Median : 1.900
## Mean :-73.85 Mean :40.70 Mean :1.359 Mean : 2.879
## 3rd Qu.:-73.91 3rd Qu.:40.79 3rd Qu.:1.000 3rd Qu.: 3.600
## Max. : 0.00 Max. :42.32 Max. :9.000 Max. :268.190
##
## Fare_amount Extra MTA_tax Tip_amount
## Min. :-499.00 Min. :-4.5000 Min. :-0.5000 Min. :-13.200
## 1st Qu.: 6.50 1st Qu.: 0.0000 1st Qu.: 0.5000 1st Qu.: 0.000
## Median : 9.50 Median : 0.5000 Median : 0.5000 Median : 0.000
## Mean : 12.51 Mean : 0.3502 Mean : 0.4868 Mean : 1.307
## 3rd Qu.: 15.00 3rd Qu.: 0.5000 3rd Qu.: 0.5000 3rd Qu.: 2.000
## Max. :3347.50 Max. : 4.5000 Max. : 0.5000 Max. :300.080
##
## Tolls_amount Ehail_fee improvement_surcharge Total_amount
## Min. :-5.5400 Mode:logical Min. :-0.3000 Min. :-499.00
## 1st Qu.: 0.0000 NA's:1404726 1st Qu.: 0.3000 1st Qu.: 8.19
## Median : 0.0000 Median : 0.3000 Median : 11.76
## Mean : 0.1191 Mean : 0.2921 Mean : 15.09
## 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 18.30
## Max. :98.0000 Max. : 0.3000 Max. :3349.30
##
## Payment_type Trip_type
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000
## Mean :1.515 Mean :1.021
## 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :5.000 Max. :2.000
## NA's :2
Download Map data with Cordinate auto-adjustory to NYC Taxi Data
# Find outliers
remove.outliers <- function(x) x[!x %in% boxplot.stats(x)$out]
# Find min and max at the same time
min.and.max <-function(x) c(min(x),max(x))
# Find Bondery
longitude.summary <- min.and.max(c(
min.and.max(remove.outliers(trip.data$Pickup_longitude)),
min.and.max(remove.outliers(trip.data$Dropoff_longitude))
))
latitude.summary <- min.and.max(c(
min.and.max(remove.outliers(trip.data$Pickup_latitude)),
min.and.max(remove.outliers(trip.data$Dropoff_latitude))
))
map <- ggmap::get_stamenmap(bbox = c(left = longitude.summary[1],
bottom=latitude.summary[1],
right= longitude.summary[2],
top = latitude.summary[2]
),
zoom = 12,
maptype = "toner-lite")
Loading the whole data set would be too much, to find an optimium value for kmeans ### Get sample
sample.size <- 40000
sample.row.nums <- sample.int(length(row.names((trip.data))), sample.size)
trip.data.sample <- trip.data[sample.row.nums,]
trip.data.sample <- trip.data.sample[trip.data.sample$Pickup_longitude<0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Pickup_latitude >0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Dropoff_longitude<0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Dropoff_latitude >0,]
taxi.sample.size <- length(row.names(trip.data.sample))
print(paste("Sample Size:", taxi.sample.size))
## [1] "Sample Size: 39911"
Red dots for Dropoffs and Blue for Pickups
Taxi.Pickup.And.Dropoff.with.Sample.Data <- function(){
ggmap::ggmap(map, main = "Taxi Pickup And Dropoff with Sample Data") +
ggplot2::geom_point(ggplot2::aes(x = Pickup_longitude, y = Pickup_latitude),
data = trip.data.sample,
shape = ".", color = "blue")+
ggplot2::geom_point(ggplot2::aes(x = Dropoff_longitude, y = Dropoff_latitude),
data = trip.data.sample,
shape = ".", color = "red")
}
save.and.print(
Taxi.Pickup.And.Dropoff.with.Sample.Data,
'taxi_pickup_and_dropoff_sample.pdf')
Fisrt Just pick k=100 to see a sample reslut.
t.k.s.d <- kmeans(trip.data.sample[,8:9],100, iter.max=100)
t.k.s.d$cluster <- as.factor(t.k.s.d$cluster)
Taxi.Cluster.Sample.with.k.100 <- function(){
ggmap::ggmap(map, main = "Taxi Cluster Sample with k=100") +
ggplot2::geom_point(data = trip.data.sample, shape = ".",
ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = t.k.s.d$cluster)
)+
ggplot2::geom_point(data = as.data.frame(t.k.s.d$centers), shape = 21,
ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
fill = as.factor(row.names(t.k.s.d$centers)),
size = as.numeric(t.k.s.d$size))
)+ theme(legend.position="none")
}
save.and.print(
Taxi.Cluster.Sample.with.k.100,
"taxi_cluster_sample_with_k=100.pd")
Now let’s find a optimium value of k using elbow mothod.
This is where computation gets time consumeing, so I decide to run it on sample data set instead of a whole one.
taxi.k.numbers <- 1:80
taxi.collect.kmean.statistics <- function (k) {
kmean.result <- kmeans(trip.data.sample[,8:9],centers = k, iter.max = max(100,k*2), algorithm="MacQueen")
c( kmean.result$tot.withinss, kmean.result$betweenss )
}
taxi.k.means.statitcs <- plyr::laply(taxi.k.numbers,
taxi.collect.kmean.statistics,
.progress= plyr::progress_text(char = '+')
)
|+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
# Set label
taxi.k.means.statitcs <- as.data.frame(taxi.k.means.statitcs)
colnames(taxi.k.means.statitcs) <- c("Withiness","Betweeness")
row.names(taxi.k.means.statitcs) <- taxi.k.numbers
# Transform to mean value instead of sum
taxi.k.means.statitcs$Withiness <- taxi.k.means.statitcs$Withiness/taxi.sample.size
taxi.k.means.statitcs$Betweeness <- taxi.k.means.statitcs$Betweeness/taxi.sample.size
Taxi.Withingness.And.Betweeness.by.different.K <- function(){
k <- as.numeric(row.names(taxi.k.means.statitcs))
ggplot2::ggplot(taxi.k.means.statitcs, main = "Taxi Withingness And Betweeness by different K")+
ggplot2::geom_line(ggplot2::aes(x = k, y = Withiness), color = "blue")+
ggplot2::geom_line(ggplot2::aes(x = k, y = Betweeness), color = "red")
}
save.and.print(
Taxi.Withingness.And.Betweeness.by.different.K,
"Taxi.Withingness.And.Betweeness.by.different.K.pdf"
)
Red for Betweeness and Blue for Withiness.
Taxi.Withingness.And.Betweeness.by.different.K.Closer <- function(){
closer <- taxi.k.means.statitcs[3:30,]
k <- as.numeric(row.names(closer))
ggplot2::ggplot(closer, main = "Taxi Withingness And Betweeness by different K - Closer")+
ggplot2::geom_line(ggplot2::aes(x = k, y = Withiness), color = "blue")+
ggplot2::geom_line(ggplot2::aes(x = k, y = Betweeness), color = "red")
}
save.and.print(
Taxi.Withingness.And.Betweeness.by.different.K.Closer,
"Taxi.Withingness.And.Betweeness.by.different.K.closer.pdf"
)
Red for Betweeness and Blue for Withiness. So we pick up k = 7
k=7
t.k.s.d <- kmeans(trip.data[,8:9],k, iter.max=200, algorithm="MacQueen")
t.k.s.d$cluster <- as.factor(t.k.s.d$cluster)
Taxi.Need.Clustering <- function(){
ggmap::ggmap(map, main = "Taxi Need Clustering") +
ggplot2::geom_point(data = trip.data, shape = ".",
ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = as.factor(t.k.s.d$cluster))
)+
ggplot2::geom_point(data = as.data.frame(t.k.s.d$centers), shape = 21,
ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
fill = as.factor(row.names(t.k.s.d$centers)),
size = as.numeric(t.k.s.d$size))
) + theme(legend.position="none")
}
save.and.print(Taxi.Need.Clustering, "Taxi.Need.Clustering.pdf")
From the grapgh we can see that k=5 seems meanningful, but, not suitable for CitiBike Potential Station predicting: It’s too rough. So what do we do?
I look back to think, what makes CitiBike pick all the stations that way:
print(p_all_stations)
In specific, why do they put all stations so close to each other? The Answer is simple: People walk to nearby stations to get bikes! In other words: Distance is very limited! So the key to what Taxi data should take standard for k value adoption can be infered from Distance of Stations already exist
I’ve developed an metrix for measuring Distance of Stations: Mean squared Close Distance (MCD)
# MCD
MCD <- function(points){
# Calculate Distance Matrix
distance.matrix <- function(points, func){
as.matrix(apply(points, 1, function(p1) apply(points, 1, function(p2) func(p1, p2))))
}
point.distances <- distance.matrix(points, fun =
function(x,y) (x[1]-y[1])^2 + (x[2]-y[2])^2
)
# Find Nearest Distance
min.no <- function(A) min(A[A>0])
point.closest.distance <- apply(point.distances, 1, min.no)
# Mean
mean.point.closest.distance <- mean(point.closest.distance)
mean.point.closest.distance
}
mean.station.closest.distance <- MCD(stations[,c('STATION_LON','STATION_LAT')])
print(paste("MCD of Stations:", mean.station.closest.distance))
## [1] "MCD of Stations: 5.56514561189345e-06"
taxi.max.k <- 500
taxi.collect.kmean.statistics.MCD <- function (k) {
kmeans.result <- kmeans(trip.data.sample[,8:9],centers = k, iter.max = max(100,k*2), algorithm="MacQueen")
MCD(kmeans.result$centers)
}
taxi.k.means.statitcs.MCD <- plyr::laply(1:taxi.max.k,
taxi.collect.kmean.statistics.MCD,
.progress= plyr::progress_text(char = '+')
)
##
|+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
taxi.k.means.statitcs.MCD<- data.frame(taxi.k.means.statitcs.MCD)
taxi.k.means.statitcs.MCD$k <- 1:taxi.max.k
colnames(taxi.k.means.statitcs.MCD) <- c("MCD","k")
Taxi.MCD.by.different.K <- function(){
k <- as.numeric(row.names(taxi.k.means.statitcs.MCD))
ggplot2::ggplot(taxi.k.means.statitcs.MCD, main = "Taxi MCD by different K")+
ggplot2::geom_line(ggplot2::aes(x = k, y = MCD))
}
save.and.print(
Taxi.MCD.by.different.K,
"Taxi.MCD.by.different.K.pdf"
)
taxi.k.means.statitcs.MCD[taxi.k.means.statitcs.MCD$MCD<mean.station.closest.distance,]
## [1] MCD k
## <0 rows> (or 0-length row.names)
k=300
all.stations.should.kmeans <- kmeans(trip.data[,8:9],k, iter.max=200, algorithm="MacQueen")
all.stations.should.kmeans$cluster <- as.factor(all.stations.should.kmeans$cluster)
Stations.should.be.here <- function(){
ggmap::ggmap(map, main = "Stations should be here") +
ggplot2::geom_point(data = trip.data, shape = ".",
ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = as.factor(all.stations.should.kmeans$cluster))
)+
ggplot2::geom_point(data = as.data.frame(all.stations.should.kmeans$centers), shape = 21,
ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
fill = as.factor(row.names(all.stations.should.kmeans$centers)),
size = as.numeric(all.stations.should.kmeans$size))
) + theme(legend.position="none")
}
save.and.print(Stations.should.be.here, "Stations.should.be.here.pdf")
Human activities cannot be observed directly, but the picking and docking activities at stations provide indirect yet robust data for us to investigate the hidden pattern. First it can be inferred from data visualization part that there exists possible hidden pattern of people to use CitiBike. Second, the hypothesis is proven by looking at the temporal profiles of about 300 stations. The temporal profiles are based on P/D index, which is a novel metric for merging picking and docking information together and provides a straightforward method for data analysis. In addition, we modeled the “rush-hour” properties of specific station as a example to suggest a time-schedule for CitiBike company to manually balance bikes at the specific station. Finally, we identified the communities of stations by performing Info-map algorithm and the communities are roughly consistent to the administrative regions around New York City, which suggest our findings are robust. New CitiBike Stations should be added, since there are so much needs and only parts of them are fullfilled.
Future work is to find whether there are stations of which the temporal profile changes after some year. For example, station-X during 2013-2015 is Type-A station, i.e. home-like station. But all of a sudden, since 2016, the temporal profile of station-X changed to Type-C station, i.e. stable through all the day. It would suggest either the neighborhood got entirely changed, or the major transportation station got changed. The meaning is that without directly investigating the hidden pattern of human activities, we can still have an indirect approach to capture the changes. An order of new Stations can be figured. For this, meausre of renevues are needed, this can be done once estimate of income from each stations are available.
Yun Yan
Willian Zhang
All works described in Project Proposal have been finished, including the optional one (network analysis). During presentation day, only first part (data visualization) and second part (subgroups of stations) are introduced, while the third and final part (time series analysis and community detection) are not because of running out time. But the same content can be found in the submitted presentation slide file.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.1 (Sierra)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] NbClust_3.0 igraph_1.0.1 ggmap_2.6.1
## [4] pheatmap_1.0.8 forecast_7.2 timeDate_3012.100
## [7] zoo_1.7-13 cowplot_0.7.0 scales_0.4.0
## [10] readr_1.0.0 lubridate_1.6.0 ggplot2_2.1.0
## [13] reshape2_1.4.1 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 formatR_1.4 RColorBrewer_1.1-2
## [4] plyr_1.8.4 tseries_0.10-35 tools_3.3.1
## [7] digest_0.6.10 evaluate_0.9 tibble_1.2
## [10] gtable_0.2.0 lattice_0.20-33 png_0.1-7
## [13] DBI_0.5-1 mapproj_1.2-4 yaml_2.1.13
## [16] parallel_3.3.1 proto_0.3-10 stringr_1.1.0
## [19] knitr_1.14 maps_3.1.1 RgoogleMaps_1.4.1
## [22] grid_3.3.1 nnet_7.3-12 R6_2.2.0
## [25] jpeg_0.1-8 rmarkdown_1.1 sp_1.2-3
## [28] magrittr_1.5 MASS_7.3-45 htmltools_0.3.5
## [31] assertthat_0.1 geosphere_1.5-5 colorspace_1.2-6
## [34] fracdiff_1.4-2 labeling_0.3 quadprog_1.5-5
## [37] stringi_1.1.1 lazyeval_0.2.0 munsell_0.4.3
## [40] rjson_0.2.15